\documentclass{llncs}
\bibliographystyle{plain}
\usepackage{graphicx}
\usepackage{float}
%
\begin{document}
\title{Genetic Algorithm Optimization in Maze Solving Problem}
\author{Thomas Pasquier, Julien Erdogan}
\institute{Institut Superieur d'Electronique de Paris\\ \email{thomas.pasquier@isep.fr} \email{julien.erdogan@isep.fr}}
	
\maketitle


\begin{abstract}
In this essay, we studied crossing selection and local optimization for Genetic Algorithm. In particular we will study how to enhance performance of Genetic Algorithm for Maze Resolution Problem. We will study the impact of different operators, crossover and mutation, over the diversity and the performance. We will also try to determine the parameters which provides the best results.
\end{abstract}


\section{Introduction}
%
Genetics is the science which deals with the molecular structure and the function of genes. It studies the behaviour of genes in the context of cells or organism, the pattern of inheritance from parents to offspring and the genes distribution, variation and change in a population. In nature genes correspond of four difference DNA nucleotides sequences encoding a particular behaviour. If genes play a large role in the definition of an individual, it is the combination of genetics and individual experience that determines the ultimate outcome. More fitted individuals had more chance to survive and transmit their genes, so the population slowly evolves towards more fitted individuals. Genetic mutations generate new behaviours which if positive may be transmitted from generation to generation and become a dominant feature.\\

Genetic algorithm is an heuristic algorithm that mimics natural evolution process. It's an algorithm commonly used to generate solutions to optimization and search problems. A population of candidate solution to an optimization or search problem called individuals encoded by byte sequences called genome evolve towards an optimal solution. At each generation, the fitness of each individuals is evaluated, and some individuals are stochastically selected(based on their fitness) and modified (combined and randomly mutated) to generate a new population. This new population is used in the next iteration of the algorithm. The algorithm terminate when a maximum number of iteration is reached, a satisfactory solution is found or the genes converge.\\

To implement a genetic algorithm one needs to define a fitness function to evaluate the solution domain and a genetic representation of this solution domain. The fitness function return a single numerical fitness, which is supposed to be proportional to the utility or the ability of the individual that chromosome represents\cite{overview11993}.\\

When a Genetic Algorithm is correctly implemented, populations evolved such that the average fitness and the best individual fitness in each generation increase towards an optimum value. Convergence is the progression towards increasing uniformity. A gene is said to converge when 95\% of the population share the same value\cite{dej75}.\\

It is very hard to know what values of GA parameters (such as population size, mutation operator, probability of mutation, etc.) to use to solve a problem. This problem has been investigated by Kalyanmoy Deb and Samir Agrawal\cite{Deb99understandinginteractions}, they highlighted the following facts :\\
- Population either too small or too large are detrimental. For a population too small the number of generation required to solve a problem is too large. If the population is too large, the number of generation needed to find an optimum is too large.\\
- GA with crossover and mutation operator perform better than implementation using only crossover or only mutation.\\
- For complex problem with multi-modality\footnote{Those category of problem contain a large number of false attractor.} and high chance of deception the crossover operator is the most important factor as mutation alone fail miserably to solve those kind of problems.\\

Deception is formally described\cite{overview21993} as : when the average of fitness of schemata which are not contained in the global optimum is greater than the average of those which are. A problem is defined as fully deceptive if all low-order schemata containing a suboptimal solution are better than other competing schemata. Deceptive problem are very hard to solve.\\

Our problem fall under the category of problem with multi-modality and high deception rate. So, we should put a strong emphasis on the cross-over operator while the mutation operator will play a less important role.
%

\section{Rules}
%	
Our maze is composed of 4 types of cases:\\
-	maze entrance\\
-	maze exit\\
-	path\\
-	wall\\
The entrance and the exit are situated respectively at the top left corner and the bottom right one(see fig 1.). A path can't form what we call rooms (structures formed of m*n contiguous path, where n > = and m > = 2 ).

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/maze.png}
\label{fig=maze}
\caption{Representation of a maze}
\end{figure}
%

\section{How to measure genetic algorithm performance}
%
\subsection{Diversity}

Diversity allow us to measure the current level of convergence. The diversity is the number of gene different between two individuals divided by the total number of genes. The closer the diversity is to 0, the closer we are from a solution (being the optimal solution or a suboptimal one). Diversity is one of the most important factors, the process is said to converge when the individuals of the gene pool are identical or near to be so. Once this event occurs, the crossover operator cease to generate new individuals, the algorithm allows all its trials to a very small sub-space. Sadly, it may occurs before the true optimum as been found, it's called premature convergence\cite{diversity1984}.

\subsection{On-line and Off-line Performance}

There is two common performance measures used to evaluate the effectiveness of function optimizer : the on-line and off-line performance. On-line performance is the mean of all trials (or generation), while off-line performance is the mean of the best individual of each generation\cite{diversity1984}. On-line performance is suited when the learning process is done while performing the task such as gambling or economics. Off-line performance only evaluates the best individual and is better suited to problem whose objective is to find a solution or to create a model.

\begin{figure}[H]
\center
\includegraphics[scale=0.6]{images/online_offline_equation.png}
\label{fig=online_offline_equation}
\caption{Equation of on-line and off-line performance.}
\end{figure}

In our case, the most important measure will be the off-line performance as we are trying to find a solution for our maze.
%

\section{Diversity, Mutation and Crossover Operator}
%
During the reproductive phase of a GA, individuals are selected among the population and recombined producing offspring which will compose the next generation. Parents are selected randomly from the population using a scheme which favours the fittest individuals (good individuals may reproduce several time, while poor individual may never reproduce)\cite{overview11993}.\\

When two parents has been selected, their chromosomes are recombined by using the mechanisms of crossover and mutation. Crossover is basically taking parts of both parent genome to create a new offspring genome. Mutation alter each gene with a small probability.\\

\subsection{Many Villages Algorithm}

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/manyvillages.png}
\label{fig=manyvillages}
\caption{How many villages algorithm works}
\end{figure}

Many Villages Algorithm divide the population in several separated subgroups, called villages, which merge after a certain number of generation (as described in the above schema).\\

It has been demonstrated\cite{segrega2002,segrega2001} that Many Villages Algorithm or SEGA algorithm (for Segregative Genetic Algorgrithm) avoids too early convergence by preserving more genetic diversity. Many Villages Algorithm also allow to parralelize computation and reduce computation runtime for an equivalent sized population.\\

Many Villages algorithm is inspired by the natural phenomena of the emergence of sub-species if sub-population of a same species are separated from each other.\\

We discussed the diversity measure earlier. In case of many villages we consider the diversity to be the average of the diversity of all villages.\\

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/villages_graph.png}
\label{fig=villages_graph}
\caption{Diversity in regard of the number of villages}
\end{figure}

We see clearly here that many villages algorithm maintain a rather high diversity over a longer period. With the same population size,  when the population is not divided into village (normal GA implementation) convergence occurs in around 40 generations, while with this same population is divided between sixteen villages we converge much later.\\

An early convergence may occur inside a village, but when those village are merged the diversity grew up again allowing to eventually find a better solution to the problem.\\

It is also interesting to note that on the same machine using the same operators, it needs 20 minutes to compute 2000 generation if the population is not divided, while it takes only 6 when there is 16 villages.\\
%

\subsection{Crossover Operator}

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/simple_cross_algorithm.png}
\label{fig=simple_cross_algorithm}
\caption{Java implementation of Improved Segment Crossover Operator.}
\end{figure}

Simple Crossover Operator randomly choose each gene from the mother or the father.\\

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/segment_cross_algorithm.png}
\label{fig=segment_cross_algorithm}
\caption{Java implementation of Segment Crossover Operator.}
\end{figure}

Segment Crossover Operator divide the genome in three parts (of random size).  The offspring take segment from the mother or the father randomly.\\

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/improved_cross_algorithm.png}
\label{fig=improved_cross_algorithm}
\caption{Java implementation of Improved Segment Crossover Operator.}
\end{figure}

Improved Segment Crossover operator behave like the Segment Crossover Operator except that the first segment size is the size of the best working genes sequence between the two parents.

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/cross_operator_offline.png}
\label{fig=cross_operator__offline}
\caption{Cross over operator off-line performance.}
\end{figure}

It appears clearly that the "`Improved Segment Crossover Operator"' give the best off-line performance as it does not destroy working genome.\\

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/cross_operator_diversity.png}
\label{fig=cross_operator_diversity}
\caption{Influence of the crossover operator on diversity.}
\end{figure}

It is also clear that Crossover operator based on segment algorithm maintain an higher diversity than simple crossover algorithm. A simple conjecture is that simple crossover operator generates a lot of not working genome, while segment based crossover operator preserve working genes sequence.\\



\section{Local Optimization}
%
As mentioned by Kalyanmoy Deb and Samir Agrawal\cite{Deb99understandinginteractions}, Cross Over Operator is the most important operator in multi-modal problems with high deception. The role played by mutation is very low as mutations are likely to create bad individuals which would not survive. From this fact, we decided to introduce Local Optimization which modify genome using  knowledge on the problem.\\

\subsection{Change Last Operator}
\begin{figure}[H]
\center
\includegraphics[scale=0.6]{images/change_last_algorithm.png}
\label{fig=change_last_algorithm}
\caption{Java implementation of the change last operator.}
\end{figure}

The purpose of this algorithm is to replace the first wrong gene, the one which lead to a wall, by a gene which lead to a valid path into the maze.

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/change_last_graph.png}
\label{fig=change_last_graph}
\caption{Change-last operator influence on off-line performance.}
\end{figure}

By using this algorithm we see a clear improvement. It's due to the simple fact that each generation is very likely to see a more fitted individual than those in the previous generation.

\subsection{Out of Dead End Operator}

\begin{figure}[H]
\center
\includegraphics[scale=0.6]{images/dead_end_algorithm.png}
\label{fig=dead_end_algorithm}
\caption{Java implementation of the out of dead end operator.}
\end{figure}

Firstly, the algorithm check if the genes sequence leads to a dead end, then if it is the case it identify any other potential path, choose one randomly and modify the genome to take this path.\\

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/out_of_dead_end_graph.png}
\label{fig=out_of_dead_end_graph}
\caption{Out of dead end operator influence on off-line performance.}
\end{figure}

We can see that Out Of Dead End Operator has no influence on performance used alone.\\

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/change_last_dead_end_graph.png}
\label{fig=change_last_dead_end_graph}
\caption{Comparison between Change Last Operator applied alone and Change Last + Out Of Dead End operators}
\end{figure}

Change Last Operator seems to decrease the performance achieved by our genetic algorithm. However this algorithm should not be ignored as empirical test prove that it seems to help to find the optimum solution and avoid a lot of sub-optimal one. A new test protocol may need to be created to evaluate its impact over the problem resolution.\\

\subsection{Remove Loop Operator}

\begin{figure}[H]
\center
\includegraphics[scale=0.6]{images/remove_loop_algorithm.png}
\label{fig=remove_loop_algorithm}
\caption{Java implementation of the remove loop operator.}
\end{figure}

In this algorithm, the genome is explored to detect loop. We construct a vector by browsing to all the genes, when a unit vector is detected we remove the sequence of gene constructing this vector and replace it by a single gene which equals to the same unit  vector.\\

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/change_last_remove_loop_graph.png}
\label{fig=change_last_remove_loop_graph}
\caption{Comparison between Change Last Operator applied alone and Change Last + Remove Loop Operators}
\end{figure}

Remove loop as a slight effect on algorithm performance. A complexity study should be made to evaluate if the CPU consumption is worth the slight improvement.\\

\subsection{Conclusion}

We have seen that local optimization are more effective than a simple mutation. For the rest of this paper we will consider as the most effective algorithm the following:\\
- Cross over operator : Improved Segment Cross Over\\
- Improvement / Mutation Operator : Change Last Operator + Remove Loop Operator + Out Of Dead End Operator\\

%
\section{Selecting the alpha parameters}

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/ftiness_algorithm.png}
\label{fig=fitness_algorithm}
\caption{Java Implementation of the Fitness Algorithm.}
\end{figure}

Our fitness function rely on two factor. One is the distance from the shortest path possible, the second one is the distance from the exit.\\

The alpha parameter influence how the fitness value is computed. Fitness value has direct influence over the diversity. The most effective way to evaluate if our alpha value as been well chosen is to vary this alpha value and to verify which one provide us with the highest diversity over time.\\

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/alpha_diversity.png}
\label{fig=alpha_diversity}
\caption{Influence of alpha parameter on diversity.}
\end{figure}

We see that after the first hundred generations the diversity vary between 15\% and 25\%. An alpha parameters of 0.9 seems to offer the highest diversity over the whole period. We will compare how the off-line performance vary with the four alpha parameters offering the highest diversity.\\

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/alpha_off.png}
\label{fig=alpha_off}
\caption{Influence of alpha parameter on off-line performance.}
\end{figure}

We cannot compare the value directly, we have to compare the shape of the curve. We can see that all the curve have exactly the same shape. We decided to choose an alpha parameters of 0.7, as it does not put too many emphasis on one of the two factor of our fitness function and provides good performance.\\

\section{Population size}

So far we have selected our operators,our alpha parameter, we now have to select the optimum population size. We will study the influence of the population size on the diversity and the off-line performance. We will test with twenty villages of the following size each : 10, 20, 50, 100. A too large population will prevent a convergence to happen, a too small one will decrease the diversity and lead to an early convergence.\\

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/size_diff.png}
\label{fig=size_diff}
\caption{Influence of population size on diversity.}
\end{figure}

We can see that the larger the population, the higher the diversity. For an unknown reason with a Population of size 20, the diversity curve has a totally different shape (reproducible phenomena).\\

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/size_perf.png}
\label{fig=size_perf}
\caption{Influence of population size on off-line performance.}
\end{figure}

The larger the population, the better the performance. It is also important to note that computation time is increased by the size of the population (a very simple estimation of the complexity is O(m*n*n) with m the number of village and n the population size). A population size of 50 seems to be a good compromise as we see that a population of 100 does not increase performance or diversity that much.\\

\section{Conclusion}

\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/final_graph.png}
\label{fig=final_graph}
\caption{Comparison between the basic algorithm and the final solution.}
\end{figure}

The curve above demonstrate the clear performance improvement we have been able to make through our study. The performance has been multiplied by 4. Even if we cannot find a solution to all maze, we are way more likely to find one in a reasonable time.\\

The code developed during this project allow to perform a large set of test, to study impact of cross operator, mutation operator and many other factor. Some simple development can make it even more powerful (configuration of maze generator algorithm, configuration of mating algorithm etc.) and it can become the root of a powerful tool to test genetic algorithm in maze resolution. Further study could focus on such parameters and would increase performance even more.\\

An other path potentially interesting to explore is the parallelization of the matting process. It could be easily by using MapReduce\cite{Dean:2008:MSD:1327452.1327492} as follow :\\
- Input Key : village-id, Input Value : genome\\
- Intermediate Key village-id, Intermediate Value : genome\\
- Output Key village-id, Output Value : genome\\

The Map operator would filter the input using a fittest threshold (defined by the average fittest score of the previous generation). The Reduce operator would cross the genome who survived through the Map operator and generate the new generation put in the Output file. We call this algorithm recursively. It would without a doubt increase the performance by a very large factor.\\

An other interesting approach would be to create initial villages with different operators and parameters take those into consideration when crossing two individuals. Genetic selection may eliminate under performing combination and lead to operators and parameters combination fitted to the really maze we are trying to solve. It may lead to a two level genetic algorithm, one level will be the actual solution and the second a genetic algorithm over the parameter influencing the generation of the first one. In An Overview of Genetic Algorithms\cite{overview21993}, advanced crossover operator learn where the crossover operator should be most probable to happened (which is a more evolved and less systematic approach of our \"Improved Segment Crossover\"), other example of dynamic operator are described in the same paper.\\

Finally further study of the algorithms used here could lead to simplification and complexity reduction, if done it makes no doubt that it will increase the performance by an important factor. 

\section{Note}
%
Implementation sources are available through subversion at :\\ 
https://geneticmaze.googlecode.com/svn/trunk/\\
Manual, application and other documents are available here : \\
http://code.google.com/p/geneticmaze/downloads/list\\

\subsection{How to use the program}
To test the algorithms we used a Java program. To launch the program you need to install Java first. There is a link to download the installer for Windows : http://www.java.com/fr/download/.

The program can use an xml file to perform the algorithms you want to solve the maze. This file is set as followed.\\
\begin{figure}[H]
\center
\includegraphics[scale=0.4]{images/config_file.png}
\label{fig=config_file}
\caption{Default configuration XML file}
\end{figure}

The XML config file contains the cross behaviour, the improvements behaviours, the mutation behaviour, the fitness behaviour and the alpha value. If no xml file is passed to the program, it will use the default configuration XML file shown above.\\
To launch the program you must use the following line in the windows command prompt : \\
$ java -jar Maze.jar ConfigurationFile.xml $
%
\bibliography{bibliography}
\end{document}